When it comes to the world of businesses, having good reviews equals good economical flow in most cases, but just how influential can reviews really be? For our project, we are looking at the effects of positive and negative reviews on businesses using Yelp’s review dataset. With that in mind, we also want to show how successful a business can get based off of initial reviews.
For this project, the main coding language will be Python. Since we are using the over 8 million reviews as a baseline to solve our problem, we will be using Spark so we can be flexible in the ways we download/upload the data from yelp to our iPynb.
business = ss.read.json("data/yelp_academic_dataset_business.json")
checkin = ss.read.json("data/yelp_academic_dataset_checkin.json")
review = ss.read.json("data/yelp_academic_dataset_review.json")
tip = ss.read.json("data/yelp_academic_dataset_tip.json")
user = ss.read.json("data/yelp_academic_dataset_user.json")
21/12/16 06:45:03 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
business.printSchema()
print(business.count())
root |-- address: string (nullable = true) |-- attributes: struct (nullable = true) | |-- AcceptsInsurance: string (nullable = true) | |-- AgesAllowed: string (nullable = true) | |-- Alcohol: string (nullable = true) | |-- Ambience: string (nullable = true) | |-- BYOB: string (nullable = true) | |-- BYOBCorkage: string (nullable = true) | |-- BestNights: string (nullable = true) | |-- BikeParking: string (nullable = true) | |-- BusinessAcceptsBitcoin: string (nullable = true) | |-- BusinessAcceptsCreditCards: string (nullable = true) | |-- BusinessParking: string (nullable = true) | |-- ByAppointmentOnly: string (nullable = true) | |-- Caters: string (nullable = true) | |-- CoatCheck: string (nullable = true) | |-- Corkage: string (nullable = true) | |-- DietaryRestrictions: string (nullable = true) | |-- DogsAllowed: string (nullable = true) | |-- DriveThru: string (nullable = true) | |-- GoodForDancing: string (nullable = true) | |-- GoodForKids: string (nullable = true) | |-- GoodForMeal: string (nullable = true) | |-- HairSpecializesIn: string (nullable = true) | |-- HappyHour: string (nullable = true) | |-- HasTV: string (nullable = true) | |-- Music: string (nullable = true) | |-- NoiseLevel: string (nullable = true) | |-- Open24Hours: string (nullable = true) | |-- OutdoorSeating: string (nullable = true) | |-- RestaurantsAttire: string (nullable = true) | |-- RestaurantsCounterService: string (nullable = true) | |-- RestaurantsDelivery: string (nullable = true) | |-- RestaurantsGoodForGroups: string (nullable = true) | |-- RestaurantsPriceRange2: string (nullable = true) | |-- RestaurantsReservations: string (nullable = true) | |-- RestaurantsTableService: string (nullable = true) | |-- RestaurantsTakeOut: string (nullable = true) | |-- Smoking: string (nullable = true) | |-- WheelchairAccessible: string (nullable = true) | |-- WiFi: string (nullable = true) |-- business_id: string (nullable = true) |-- categories: string (nullable = true) |-- city: string (nullable = true) |-- hours: struct (nullable = true) | |-- Friday: string (nullable = true) | |-- Monday: string (nullable = true) | |-- Saturday: string (nullable = true) | |-- Sunday: string (nullable = true) | |-- Thursday: string (nullable = true) | |-- Tuesday: string (nullable = true) | |-- Wednesday: string (nullable = true) |-- is_open: long (nullable = true) |-- latitude: double (nullable = true) |-- longitude: double (nullable = true) |-- name: string (nullable = true) |-- postal_code: string (nullable = true) |-- review_count: long (nullable = true) |-- stars: double (nullable = true) |-- state: string (nullable = true)
[Stage 5:=======> (1 + 7) / 8]
160585
checkin.printSchema()
print(checkin.count())
root |-- business_id: string (nullable = true) |-- date: string (nullable = true)
[Stage 7:====================================> (5 + 3) / 8]
138876
review.printSchema()
print(review.count())
root |-- business_id: string (nullable = true) |-- cool: long (nullable = true) |-- date: string (nullable = true) |-- funny: long (nullable = true) |-- review_id: string (nullable = true) |-- stars: double (nullable = true) |-- text: string (nullable = true) |-- useful: long (nullable = true) |-- user_id: string (nullable = true)
[Stage 9:======================================================> (50 + 2) / 52]
8635403
categories = (business
.select("business_id", "categories")
.withColumn("categories", F.explode(F.split(F.col("categories"), ", "))))
categories.show()
+--------------------+--------------------+ | business_id| categories| +--------------------+--------------------+ |6iYb2HFDywm3zjuRg...| Gastropubs| |6iYb2HFDywm3zjuRg...| Food| |6iYb2HFDywm3zjuRg...| Beer Gardens| |6iYb2HFDywm3zjuRg...| Restaurants| |6iYb2HFDywm3zjuRg...| Bars| |6iYb2HFDywm3zjuRg...|American (Traditi...| |6iYb2HFDywm3zjuRg...| Beer Bar| |6iYb2HFDywm3zjuRg...| Nightlife| |6iYb2HFDywm3zjuRg...| Breweries| |tCbdrRPZA0oiIYSmH...| Salad| |tCbdrRPZA0oiIYSmH...| Soup| |tCbdrRPZA0oiIYSmH...| Sandwiches| |tCbdrRPZA0oiIYSmH...| Delis| |tCbdrRPZA0oiIYSmH...| Restaurants| |tCbdrRPZA0oiIYSmH...| Cafes| |tCbdrRPZA0oiIYSmH...| Vegetarian| |bvN78flM8NLprQ1a1...| Antiques| |bvN78flM8NLprQ1a1...| Fashion| |bvN78flM8NLprQ1a1...| Used| |bvN78flM8NLprQ1a1...|Vintage & Consign...| +--------------------+--------------------+ only showing top 20 rows
categories_count = (categories
.groupBy("categories")
.count()
.sort("count", ascending=False))
categories_count.show()
[Stage 19:==================================================> (186 + 9) / 200]
+--------------------+-----+ | categories|count| +--------------------+-----+ | Restaurants|50763| | Food|29469| | Shopping|26205| | Beauty & Spas|16574| | Home Services|16465| | Health & Medical|15102| | Local Services|12192| | Nightlife|11990| | Bars|10741| | Automotive|10119| |Event Planning & ...| 9644| | Active Life| 9231| | Coffee & Tea| 7725| | Sandwiches| 7272| | Fashion| 6599| |American (Traditi...| 6541| | Hair Salons| 5900| | Pizza| 5756| | Hotels & Travel| 5703| | Breakfast & Brunch| 5505| +--------------------+-----+ only showing top 20 rows
categories_df = categories_count.limit(20).toPandas()
categories_histogram = px.histogram(categories_df, y="categories", x="count")
categories_histogram.show()
review_dates = (review
.join(business, "business_id")
.select("review_id", "business_id", "date")
.sort("business_id", "date", ascending=False)
.withColumn("month", F.split(F.col("date"), "-")[1])
.withColumn("year", F.split(F.col("date"), "-")[0]))
review_dates.show()
review_count_by_date = (review_dates
.select("month", "year")
.sort("year", "month", ascending=False))
review_count_by_date.groupBy("month").pivot("year").count().sort("month", ascending=True).show()
review_by_year = (review_count_by_date
.groupBy("year").count()
.sort("year"))
review_by_year.show()
+--------------------+--------------------+-------------------+-----+----+ | review_id| business_id| date|month|year| +--------------------+--------------------+-------------------+-----+----+ |RIlyZDUUL7dn-wX9R...|zzzKmD9Mj6WtJwJUh...|2012-01-05 07:25:19| 01|2012| |Beiu8FUpabvKp_tIj...|zzzKmD9Mj6WtJwJUh...|2011-07-06 03:11:19| 07|2011| |NsbPfBB7VcKavo2kP...|zzzKmD9Mj6WtJwJUh...|2011-01-30 07:26:01| 01|2011| |ZLg9JZBm8fti3kdcK...|zzzKmD9Mj6WtJwJUh...|2010-12-26 08:06:25| 12|2010| |_PDg4GKwqstcU4jqS...|zzzKmD9Mj6WtJwJUh...|2010-05-05 21:14:56| 05|2010| |7Me9k1V0KNyqAds43...|zzzKmD9Mj6WtJwJUh...|2010-03-07 20:52:10| 03|2010| |PEgXvQE2ZKGQZWqt5...|zzzKmD9Mj6WtJwJUh...|2009-10-22 19:53:49| 10|2009| |coFCcXcoVnTLwNMct...|zzzKmD9Mj6WtJwJUh...|2009-10-13 08:08:57| 10|2009| |79BNQZEWEcZjZe4G7...|zzxIO4bjWCQrvCJ0O...|2020-06-05 12:17:10| 06|2020| |DWbIic86a-H_8nau6...|zzxIO4bjWCQrvCJ0O...|2020-01-09 21:53:36| 01|2020| |6LVhP-V7QtsjN34Ao...|zzxIO4bjWCQrvCJ0O...|2019-08-15 22:00:46| 08|2019| |h5xb4XzKaFr_hBS1_...|zzxIO4bjWCQrvCJ0O...|2018-05-09 11:46:59| 05|2018| |64mGlTN9U8uh3qLQq...|zzxIO4bjWCQrvCJ0O...|2011-11-02 14:02:07| 11|2011| |K0DGtUbEh9mKHrLh4...|zzwK-TJsCJX5wZrdt...|2016-08-02 16:31:57| 08|2016| |uQzssdiURqnvdQVPd...|zzwK-TJsCJX5wZrdt...|2016-07-24 19:06:03| 07|2016| |xgfrMLg0mBFkbzZyN...|zzwK-TJsCJX5wZrdt...|2016-07-19 04:10:43| 07|2016| |7aAloTeX7KceHKivy...|zzwK-TJsCJX5wZrdt...|2016-07-19 01:17:27| 07|2016| |d6KrSTDkBcRmqBZMh...|zzwK-TJsCJX5wZrdt...|2016-07-18 19:54:03| 07|2016| |tseKwcVWB-PBSLzbK...|zzwK-TJsCJX5wZrdt...|2015-05-06 14:45:58| 05|2015| |nw1N6IVukv-Ez_JTL...|zzwK-TJsCJX5wZrdt...|2014-05-06 16:05:21| 05|2014| +--------------------+--------------------+-------------------+-----+----+ only showing top 20 rows
+-----+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+-----+-----+-----+ |month|2004|2005|2006|2007| 2008| 2009| 2010| 2011| 2012| 2013| 2014| 2015| 2016| 2017| 2018| 2019| 2020| 2021| +-----+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+-----+-----+-----+ | 01|null| 1| 863|4758|10197|17799|24147|35023|40899|44374|54145|70979|82320|80022| 85867|86871|76453|44461| | 02|null| 4|1267|4466| 9863|15605|23299|33158|37994|41225|49677|63486|78146|75541| 80386|81578|74612| null| | 03|null| 25|1497|4543|10755|17154|26245|37661|41296|44404|58557|76281|81878|87367| 94662|93600|50136| null| | 04|null| 48|1873|4557|10733|15891|24116|35072|37682|42412|56016|72023|80990|87630| 93447|86074|24036| null| | 05|null| 70|1754|6012|10903|16180|23697|35669|38974|44925|58678|78086|84232|91233| 94852|91534|33660| null| | 06|null| 502|1593|6576|11580|17062|25276|36004|39974|44714|61867|78012|83452|91644| 97191|93595|43657| null| | 07|null|2412|1546|7575|14964|19087|29174|40410|43052|49496|69241|85882|90116|99411|107107|97029|50353| null| | 08|null|2271|2028|6668|17138|21619|32210|41051|42557|53526|71823|89623|86628|93197|100206|94767|54419| null| | 09|null| 320|2101|6367|14139|18689|27601|35003|37909|48239|62774|76155|77204|83352| 86467|83076|51666| null| | 10| 38| 290|2514|6662|13553|19210|28034|35750|39256|49242|63055|77803|77082|85783| 84966|82121|52815| null| | 11| 2| 212|3308|6485|13646|18389|26970|33680|37203|46552|60272|70896|68240|77382| 77851|73277|44667| null| | 12| 12| 284|3475|7247|12965|17112|26814|32711|35645|46631|60014|68303|70239|76995| 81333|74047|45417| null| +-----+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+-----+-----+-----+
[Stage 35:======================================================> (51 + 1) / 52]
+----+-------+ |year| count| +----+-------+ |2004| 52| |2005| 6439| |2006| 23819| |2007| 71916| |2008| 150436| |2009| 213797| |2010| 317583| |2011| 431192| |2012| 472441| |2013| 555740| |2014| 726119| |2015| 907529| |2016| 960527| |2017|1029557| |2018|1084335| |2019|1037569| |2020| 601891| |2021| 44461| +----+-------+
review_by_year_pd = review_by_year.toPandas()
fig = px.bar(review_by_year_pd, x="year", y="count", title="Total Reviews Per Year")
fig.show()
checkin_date = (checkin
.join(business, "business_id")
.withColumn("checkin_count", F.size(F.split(F.col("date"), ",")))
.select("business_id", "date", "checkin_count")
.sort("checkin_count", ascending=False)
.withColumn("month", F.split(F.col("date"), "-")[1])
.withColumn("year", F.split(F.col("date"), "-")[0]))
checkin_date.show()
checkin_counts_by_year = (checkin_date
.groupBy("year")
.sum("checkin_count")
.withColumnRenamed("sum(checkin_count)", "count")
.sort("year"))
checkin_counts_by_year.show()
+--------------------+--------------------+-------------+-----+----+ | business_id| date|checkin_count|month|year| +--------------------+--------------------+-------------+-----+----+ |7sxYa0-TwWeWGFr5C...|2010-01-17 02:00:...| 150005| 01|2010| |EqUqaLJxMDxIqvnma...|2010-01-16 04:58:...| 75511| 01|2010| |2PxZ-fICnd432NJHe...|2010-01-21 17:52:...| 63982| 01|2010| |UlndVD4tezU3FACjZ...|2010-01-17 13:19:...| 54657| 01|2010| |vITc1UkWLvG9pVdaO...|2010-01-18 23:59:...| 54361| 01|2010| |cyBm7p3D2RAoGlDn-...|2010-01-30 23:31:...| 25208| 01|2010| |NvAYQvjLxwGC-kcWz...|2010-01-16 17:37:...| 24017| 01|2010| |4CxF8c3MB7VAdY8zF...|2010-01-17 23:35:...| 20291| 01|2010| |OPfgKOm_n-ajUo3qj...|2010-01-18 05:12:...| 19941| 01|2010| |m3DeBd0NSbAGDjUOk...|2010-01-19 18:16:...| 18388| 01|2010| |BOFD5UmhHvCn_XH3_...|2010-02-07 02:27:...| 17033| 02|2010| |C_k727ws82eMe9xtJ...|2010-01-18 21:50:...| 16848| 01|2010| |OQ2oHkcWA8KNC1Lsv...|2010-02-13 17:19:...| 15946| 02|2010| |qeuJgUdcmL5yAweOs...|2010-01-16 22:32:...| 15451| 01|2010| |Tjgte2R4e-3JWj4aA...|2010-02-11 18:49:...| 15378| 02|2010| |zioLxtBc9THNS2TOn...|2010-01-17 04:40:...| 14934| 01|2010| |5y2zZGIE2a4tuEJ5F...|2010-01-16 02:57:...| 14153| 01|2010| |5xS76cdT7THG14H_W...|2010-02-21 14:55:...| 14151| 02|2010| |PrsvO1rzkgg6qFizl...|2010-01-18 18:40:...| 13360| 01|2010| |o_L9Ss4boqq6ZEF9x...|2012-04-14 19:15:...| 12202| 04|2012| +--------------------+--------------------+-------------+-----+----+ only showing top 20 rows
[Stage 46:====================================> (5 + 3) / 8]
+----+--------+ |year| count| +----+--------+ |2010|11717005| |2011| 1634183| |2012| 1255754| |2013| 980808| |2014| 902834| |2015| 756902| |2016| 585924| |2017| 392881| |2018| 259931| |2019| 128699| |2020| 26019| |2021| 352| +----+--------+
checkin_counts_by_year_pd = checkin_counts_by_year.toPandas()
fig = px.bar(checkin_counts_by_year_pd, x="year", y="count", title="Total Check-Ins Per Year")
fig.show()
user_reviews_location = (user
.join(review, 'user_id')
.join(business, 'business_id'))
user_reviews_location.select('user_id', 'business_id', 'latitude', 'longitude').show()
[Stage 56:======================================================> (51 + 1) / 52]
+--------------------+--------------------+-------------+--------------+ | user_id| business_id| latitude| longitude| +--------------------+--------------------+-------------+--------------+ |--hJsDxzXZURcLxaL...|Ln-8CbKGZGmF-GCqM...| 33.7787398| -84.409326| |--hJsDxzXZURcLxaL...|QS_ojFa7FpfOVThrs...| 33.923006| -84.3399804| |-0Ji0nOyFe-4yo8BK...|1hJEqPHe65oJmO6sc...| 30.3479823| -97.7353261| |-0KAIqdO-DhrIXcY9...|WyIk1ZWy31Qaa8xUi...| 30.0135114| -97.8635298| |-0KAIqdO-DhrIXcY9...|GtlDQ8GG8ykG1ion3...|30.0127452517|-97.8907751147| |-0KAIqdO-DhrIXcY9...|OUZIlIytYIiqPj2RF...| 30.0169045| -97.8633635| |-0KAIqdO-DhrIXcY9...|_Fs1js9sOEFDJ2Txg...|30.0116336043|-97.8900940009| |-0KAIqdO-DhrIXcY9...|k_JMC4M3BJ4X9pC5X...| 30.0708447| -97.873136| |-0KAIqdO-DhrIXcY9...|9BBnf7fLTs2EIyDFo...| 30.0280744| -97.8765501| |-0KAIqdO-DhrIXcY9...|4-zgy_r5j82FiX2ex...| 30.2288371| -97.788686| |-0KAIqdO-DhrIXcY9...|4-zgy_r5j82FiX2ex...| 30.2288371| -97.788686| |-0KAIqdO-DhrIXcY9...|KY-EXj--9PlSBGhOS...| 30.0130756| -97.8917627| |-0KAIqdO-DhrIXcY9...|N2lTmVjYrQrw1nl9M...| 30.1769379| -97.7904271| |-0atCwlr6SSpYniMw...|juSETXWf57AtH5rh2...| 30.2619868| -97.722894| |-1KKYzibGPyUX-Mwk...|ROa5tRU4lUn1ffu0H...| 33.757101775|-84.3881696716| |-1KKYzibGPyUX-Mwk...|SFqFFIA4Ks2oHfgEA...| 28.440381| -81.4698224| |-25JZ2VjrGZfXR8Ea...|S0QEoRla89uZyI1BC...|28.6069310793| -81.209671868| |-25JZ2VjrGZfXR8Ea...|t5EgwRXG6MZSihdqu...| 28.8114547| -81.2666383| |-25JZ2VjrGZfXR8Ea...|_OvpwIXRqRXkVjCgu...| 30.3954441| -97.7269882| |-25JZ2VjrGZfXR8Ea...|x6SoT6Zg8mrIx1uAi...| 30.2928986| -97.741776| +--------------------+--------------------+-------------+--------------+ only showing top 20 rows
review_counts = (business
.select("name", "latitude", "longitude", "review_count", "stars", "state")
.sort("stars", "review_count", ascending=False))
review_counts.show()
[Stage 15:=============================> (4 + 4) / 8]
+--------------------+-------------+--------------+------------+-----+-----+ | name| latitude| longitude|review_count|stars|state| +--------------------+-------------+--------------+------------+-----+-----+ |Powell's City of ...| 45.5230969| -122.681325| 4295| 5.0| OR| | Aviva by Kameel|33.7605381455|-84.3865554648| 1532| 5.0| GA| | Stanley Park| 49.2978842| -123.1308093| 939| 5.0| BC| |Ovation Coffee & Tea| 45.532588| -122.6810282| 740| 5.0| OR| | Pedal Bike Tours| 45.521572| -122.6729487| 736| 5.0| OR| | Ciao! Pizza & Pasta| 42.3892206| -71.0408823| 733| 5.0| MA| | 1618 Asian Fusion| 30.2453207| -97.7303193| 641| 5.0| TX| |Lake Travis Zipli...| 30.42903| -97.890062| 584| 5.0| TX| |Brakes To Go - Mo...| 30.2729209| -97.7443863| 577| 5.0| TX| | Jewboy Burgers| 30.3127408| -97.7150455| 516| 5.0| TX| | Swiss Hibiscus| 45.5587705| -122.651453| 510| 5.0| OR| |Peace Frog Specia...| 30.2648272| -97.7341597| 509| 5.0| TX| | Tōv| 45.512267| -122.6323559| 503| 5.0| OR| |Einstein Moving C...| 30.348606| -97.67326| 465| 5.0| TX| |Realty Austin - C...| 30.3154649| -97.7346205| 441| 5.0| TX| | Car Zone| 42.3748556| -71.0937018| 426| 5.0| MA| |Auto Glass Solutions| 30.380665| -97.7237| 411| 5.0| TX| |Always Available ...|30.2481061334|-97.7615793065| 411| 5.0| TX| |T-Loc's Sonora Ho...| 30.3211689| -97.7395055| 410| 5.0| TX| | Stairhopper Movers| 42.3796489| -71.0717834| 410| 5.0| MA| +--------------------+-------------+--------------+------------+-----+-----+ only showing top 20 rows
review_counts_pd = review_counts.toPandas()
fig = px.density_mapbox(review_counts_pd, lat='latitude', lon='longitude', z='review_count', title="Review Density Heatmap", radius=10, zoom=2.75,
mapbox_style="stamen-terrain")
fig.show()
business_checkin = (business
.join(checkin, "business_id")
.withColumn("checkin_count", F.size(F.split(F.col("date"), ",")))
.select("business_id", "latitude", "longitude", "checkin_count")
.sort("checkin_count", ascending=False))
business_checkin.show()
[Stage 6:> (0 + 4) / 4]
+--------------------+-------------+---------------+-------------+ | business_id| latitude| longitude|checkin_count| +--------------------+-------------+---------------+-------------+ |7sxYa0-TwWeWGFr5C...|33.6407227072| -84.4276356697| 150005| |EqUqaLJxMDxIqvnma...| 45.58979| -122.595204| 75511| |2PxZ-fICnd432NJHe...|28.4278388741| -81.308182904| 63982| |UlndVD4tezU3FACjZ...| 30.202473| -97.666861| 54657| |vITc1UkWLvG9pVdaO...|42.3622379746| -71.0210756391| 54361| |cyBm7p3D2RAoGlDn-...| 28.417663| -81.581212| 25208| |NvAYQvjLxwGC-kcWz...| 28.374694| -81.549404| 24017| |4CxF8c3MB7VAdY8zF...|45.5226120114|-122.6730836114| 20291| |OPfgKOm_n-ajUo3qj...| 45.5230969| -122.681325| 19941| |m3DeBd0NSbAGDjUOk...|49.1942850587| -123.18076195| 18388| |BOFD5UmhHvCn_XH3_...| 28.370971| -81.519392| 17033| |C_k727ws82eMe9xtJ...| 39.9980574| -82.8838088| 16848| |OQ2oHkcWA8KNC1Lsv...| 45.5230858| -122.64168| 15946| |qeuJgUdcmL5yAweOs...| 45.5246501| -122.6818687| 15451| |Tjgte2R4e-3JWj4aA...| 28.3575294| -81.5582714| 15378| |zioLxtBc9THNS2TOn...| 45.504723| -122.632105| 14934| |5y2zZGIE2a4tuEJ5F...| 30.270557| -97.753177| 14153| |5xS76cdT7THG14H_W...| 28.359719| -81.591313| 14151| |PrsvO1rzkgg6qFizl...|42.3641813865| -71.0542184385| 13360| |o_L9Ss4boqq6ZEF9x...| 45.5289444| -122.6982801| 12202| +--------------------+-------------+---------------+-------------+ only showing top 20 rows
business_checkin_pd = business_checkin.toPandas()
fig = px.density_mapbox(business_checkin_pd, lat='latitude', lon='longitude', z='checkin_count', title="Check-In Density Heatmap", radius=10, zoom=2.75,
mapbox_style="stamen-terrain")
fig.show()
# group reviews and checkins
# get count for each business grouped by months
review_checkin = (review
.join(checkin, ["business_id", "date"], "full")
.withColumn("year", F.split(F.col("date"), "-")[0])
.withColumn("combined_count", F.size(F.split(F.col("date"), ",")))
.select("business_id", "year", "combined_count")
.groupBy(["business_id", "year"])
.sum("combined_count"))
ratings = (review
.withColumn("year", F.split(F.col("date"), "-")[0])
.select("business_id", "year", "stars")
.groupBy(["business_id", "year"])
.mean("stars"))
ratings_reviews = (review_checkin
.join(ratings, ["business_id", "year"], "full"))
ratings_reviews.sort(F.desc("sum(combined_count)")).show()
[Stage 69:=====================================================>(197 + 3) / 200]
+--------------------+----+-------------------+------------------+ | business_id|year|sum(combined_count)| avg(stars)| +--------------------+----+-------------------+------------------+ |7sxYa0-TwWeWGFr5C...|2010| 150157|3.0592105263157894| |EqUqaLJxMDxIqvnma...|2010| 75582| 4.450704225352113| |2PxZ-fICnd432NJHe...|2010| 64063|3.2839506172839505| |UlndVD4tezU3FACjZ...|2010| 54720| 4.111111111111111| |vITc1UkWLvG9pVdaO...|2010| 54437| 3.486842105263158| |cyBm7p3D2RAoGlDn-...|2010| 25267| 4.220338983050848| |NvAYQvjLxwGC-kcWz...|2010| 24095| 4.256410256410256| |4CxF8c3MB7VAdY8zF...|2010| 20795|3.7876984126984126| |OPfgKOm_n-ajUo3qj...|2010| 20176| 4.753191489361702| |m3DeBd0NSbAGDjUOk...|2010| 18432| 4.136363636363637| |BOFD5UmhHvCn_XH3_...|2010| 17060|3.7777777777777777| |C_k727ws82eMe9xtJ...|2010| 16875|3.4444444444444446| |OQ2oHkcWA8KNC1Lsv...|2010| 16146| 4.495| |qeuJgUdcmL5yAweOs...|2010| 15571|3.7916666666666665| |Tjgte2R4e-3JWj4aA...|2010| 15411|3.9393939393939394| |zioLxtBc9THNS2TOn...|2010| 15202| 3.955223880597015| |5y2zZGIE2a4tuEJ5F...|2010| 14265| 4.223214285714286| |5xS76cdT7THG14H_W...|2010| 14201| 4.16| |PrsvO1rzkgg6qFizl...|2010| 13687|4.0672782874617734| |o_L9Ss4boqq6ZEF9x...|2012| 12381|4.4636871508379885| +--------------------+----+-------------------+------------------+ only showing top 20 rows
windowval = (Window.partitionBy('business_id').orderBy('date')
.rangeBetween(Window.unboundedPreceding, 0))
review_count = (review
.withColumn('rev_count', F.count('stars').over(windowval))
.withColumn("month", F.split(F.col("date"), "-")[1])
.withColumn("year", F.split(F.col("date"), "-")[0])
.select('business_id', 'month', 'year', 'stars', 'rev_count')
.groupBy(["business_id", "month", "year"])
.sum("rev_count"))
review_count.show()
[Stage 70:======================================================> (51 + 1) / 52]
+--------------------+-----+----+--------------+ | business_id|month|year|sum(rev_count)| +--------------------+-----+----+--------------+ |-36nnCT71XE0InJXK...| 07|2014| 1| |-36nnCT71XE0InJXK...| 02|2015| 2| |-36nnCT71XE0InJXK...| 07|2015| 3| |-36nnCT71XE0InJXK...| 11|2015| 4| |-36nnCT71XE0InJXK...| 03|2016| 5| |-36nnCT71XE0InJXK...| 09|2016| 6| |-36nnCT71XE0InJXK...| 10|2016| 15| |-36nnCT71XE0InJXK...| 12|2016| 9| |-36nnCT71XE0InJXK...| 02|2017| 33| |-36nnCT71XE0InJXK...| 04|2017| 13| |-36nnCT71XE0InJXK...| 05|2017| 14| |-36nnCT71XE0InJXK...| 06|2017| 31| |-36nnCT71XE0InJXK...| 12|2017| 17| |-36nnCT71XE0InJXK...| 01|2018| 37| |-36nnCT71XE0InJXK...| 06|2018| 41| |-36nnCT71XE0InJXK...| 07|2018| 69| |-36nnCT71XE0InJXK...| 08|2018| 25| |-36nnCT71XE0InJXK...| 09|2018| 26| |-36nnCT71XE0InJXK...| 03|2019| 27| |-36nnCT71XE0InJXK...| 05|2019| 28| +--------------------+-----+----+--------------+ only showing top 20 rows
windowval = (Window.partitionBy('business_id').orderBy('date')
.rangeBetween(Window.unboundedPreceding, 0))
review_sum_stars = (review
.withColumn('running_sum', F.sum('stars').over(windowval))
.withColumn("month", F.split(F.col("date"), "-")[1])
.withColumn("year", F.split(F.col("date"), "-")[0])
.select('business_id', 'month', 'year', 'running_sum')
.groupBy(["business_id", "month", "year"])
.sum("running_sum"))
review_sum_stars.show()
[Stage 72:======================================================> (51 + 1) / 52]
+--------------------+-----+----+----------------+ | business_id|month|year|sum(running_sum)| +--------------------+-----+----+----------------+ |-36nnCT71XE0InJXK...| 07|2014| 1.0| |-36nnCT71XE0InJXK...| 02|2015| 2.0| |-36nnCT71XE0InJXK...| 07|2015| 7.0| |-36nnCT71XE0InJXK...| 11|2015| 8.0| |-36nnCT71XE0InJXK...| 03|2016| 9.0| |-36nnCT71XE0InJXK...| 09|2016| 14.0| |-36nnCT71XE0InJXK...| 10|2016| 35.0| |-36nnCT71XE0InJXK...| 12|2016| 21.0| |-36nnCT71XE0InJXK...| 02|2017| 69.0| |-36nnCT71XE0InJXK...| 04|2017| 29.0| |-36nnCT71XE0InJXK...| 05|2017| 31.0| |-36nnCT71XE0InJXK...| 06|2017| 65.0| |-36nnCT71XE0InJXK...| 12|2017| 34.0| |-36nnCT71XE0InJXK...| 01|2018| 75.0| |-36nnCT71XE0InJXK...| 06|2018| 86.0| |-36nnCT71XE0InJXK...| 07|2018| 142.0| |-36nnCT71XE0InJXK...| 08|2018| 50.0| |-36nnCT71XE0InJXK...| 09|2018| 51.0| |-36nnCT71XE0InJXK...| 03|2019| 52.0| |-36nnCT71XE0InJXK...| 05|2019| 53.0| +--------------------+-----+----+----------------+ only showing top 20 rows
review_count_sum = (review_count
.join(review_sum_stars, ['business_id','month', 'year'], 'full')
.select('business_id', 'month', 'year', 'sum(rev_count)', 'sum(running_sum)')
)
review_count_sum.show()
[Stage 76:====================================================> (193 + 7) / 200]
+--------------------+-----+----+--------------+----------------+ | business_id|month|year|sum(rev_count)|sum(running_sum)| +--------------------+-----+----+--------------+----------------+ |--2mEJ63SC_8_08_j...| 11|2016| 6| 14.0| |--UNNdnHRhsyFUbDg...| 08|2016| 7132| 31341.0| |--xmN4i0Hoqx0pPtr...| 11|2017| 44| 180.0| |-0F6zJ4parKu5WMoh...| 08|2016| 6| 26.0| |-0U6xz5cKY1PB_TJh...| 05|2015| 120| 401.0| |-0XME_3EJtOIyiTaY...| 11|2012| 6| 24.0| |-0iqnv7MjKrgh7Q7b...| 04|2017| 365| 1381.0| |-11j5tUL810_k3Soo...| 11|2019| 21| 68.0| |-1b5PAJFq9L-EXEzP...| 02|2014| 234| 543.0| |-1lP2wOmxWidM6Ge4...| 09|2020| 18| 58.0| |-2CqsOEUICBAkdyu7...| 05|2018| 1| 5.0| |-2MT2xutGx_6EeZHA...| 08|2013| 19| 95.0| |-2S-_MGNtX1FcRaTO...| 11|2016| 8| 26.0| |-2aMFgYft7rED0bst...| 08|2017| 3| 15.0| |-2joeHbqY9TayADes...| 01|2016| 195| 784.0| |-2joeHbqY9TayADes...| 05|2019| 302| 1232.0| |-2joeHbqY9TayADes...| 07|2018| 279| 1141.0| |-384bwLp3FoesKs7c...| 02|2018| 90| 447.0| |-3H3oRyCbLuGHVVdq...| 06|2014| 2| 10.0| |-3VQI-QXI7fr4TWH0...| 10|2016| 757| 2783.0| +--------------------+-----+----+--------------+----------------+ only showing top 20 rows
review_checkin = (review
.join(checkin, ["business_id", "date"], "full")
.withColumn("month", F.split(F.col("date"), "-")[1])
.withColumn("year", F.split(F.col("date"), "-")[0])
.withColumn("combined_count", F.size(F.split(F.col("date"), ",")))
.select("business_id", "month", "year", "combined_count")
.groupBy(["business_id", "month", "year"])
.sum("combined_count"))
review_checkin.sort("year", "month").show()
+--------------------+-----+----+-------------------+ | business_id|month|year|sum(combined_count)| +--------------------+-----+----+-------------------+ |F6Qsav77WqGhstQ1E...| 10|2004| 1| |uSM_sCp22BJWeatTD...| 10|2004| 1| |rdS7hBBeukiX4Led9...| 10|2004| 1| |RThn3_Y6qN8MsqZKW...| 10|2004| 1| |dr9PgB_kTp998OYWi...| 10|2004| 2| |_VkLbBvqBMmDJ9EUO...| 10|2004| 1| |Hnx34LxtfW4XV0gZ1...| 10|2004| 1| |Aigcg-qqohtSmjDci...| 10|2004| 2| |sf04_4kcxi6UkXMUl...| 10|2004| 1| |_irBz9yudNSlDUVAe...| 10|2004| 1| |GPhnm_rcjkbc_b96D...| 10|2004| 1| |VvhGHzjyOdStXblX8...| 10|2004| 1| |CGbuorlhKS88aKwhb...| 10|2004| 1| |0Ov-g-4eTOTSgVDBf...| 10|2004| 1| |pvlM--HZY1a8SqMXi...| 10|2004| 1| |54ElwAyN-o8e4uvOk...| 10|2004| 1| |w7JNKVadLeZoqfpTG...| 10|2004| 1| |bJLeoKF04fHYNoxpa...| 10|2004| 1| |j6PJEamZViym5thvz...| 10|2004| 1| |JdVlUaXM9aZQ68tjg...| 10|2004| 1| +--------------------+-----+----+-------------------+ only showing top 20 rows
enough_data = (review_checkin
.groupBy("business_id")
.count())
enough_data.show()
[Stage 85:=====================================================>(198 + 2) / 200]
+--------------------+-----+ | business_id|count| +--------------------+-----+ |dW4pD2rGTIhwCLCtP...| 85| |6qlWc-OVhTQfcyVi3...| 29| |j2NOygXjaTufrsKFl...| 58| |88kosdu9NbLZnHrio...| 85| |D2bmsnOkBIt72WYxO...| 28| |RhGnmMMEaNm3JwKKD...| 56| |euJD4G-BgXYY5BgFl...| 76| |kh9qwwqUcGwgexuiU...| 33| |FbZLY5XASP9phBySt...| 39| |q_Hw7TE8IXJtaQICG...| 8| |6KGBXOeSJYf9ePdyA...| 161| |8Lnid7N3bcAWY--CI...| 28| |PBfFmFy8C0WOtpHF_...| 148| |u7s0q_yGEihxnsS0m...| 27| |MhYtCVMzARhHpApK4...| 123| |8A_OhuJIr6k0Czdz4...| 36| |CeeO1uNYB0uvTNC6M...| 31| |3KqpiLDAjeeMmZeU-...| 52| |W60O4ast9uAq03n7n...| 153| |r7-YSA7Z8p4KISZ-j...| 132| +--------------------+-----+ only showing top 20 rows
enough_data = (enough_data
.where(enough_data['count'] >= 24))
enough_data.show()
[Stage 90:===================================================> (192 + 8) / 200]
+--------------------+-----+ | business_id|count| +--------------------+-----+ |dW4pD2rGTIhwCLCtP...| 85| |6qlWc-OVhTQfcyVi3...| 29| |j2NOygXjaTufrsKFl...| 58| |88kosdu9NbLZnHrio...| 85| |D2bmsnOkBIt72WYxO...| 28| |RhGnmMMEaNm3JwKKD...| 56| |euJD4G-BgXYY5BgFl...| 76| |kh9qwwqUcGwgexuiU...| 33| |FbZLY5XASP9phBySt...| 39| |6KGBXOeSJYf9ePdyA...| 161| |8Lnid7N3bcAWY--CI...| 28| |PBfFmFy8C0WOtpHF_...| 148| |u7s0q_yGEihxnsS0m...| 27| |MhYtCVMzARhHpApK4...| 123| |8A_OhuJIr6k0Czdz4...| 36| |CeeO1uNYB0uvTNC6M...| 31| |3KqpiLDAjeeMmZeU-...| 52| |W60O4ast9uAq03n7n...| 153| |r7-YSA7Z8p4KISZ-j...| 132| |Ukr4_xiBTUPOJ8ySP...| 38| +--------------------+-----+ only showing top 20 rows
final_df = (review_count_sum
.join(review_checkin, ['business_id', 'month', 'year'], 'inner')
.join(enough_data, 'business_id', 'inner'))
final_df.show()
[Stage 101:===================================================> (194 + 6) / 200]
+--------------------+-----+----+--------------+----------------+-------------------+-----+ | business_id|month|year|sum(rev_count)|sum(running_sum)|sum(combined_count)|count| +--------------------+-----+----+--------------+----------------+-------------------+-----+ |-36nnCT71XE0InJXK...| 05|2019| 28| 53.0| 1| 25| |-36nnCT71XE0InJXK...| 12|2017| 17| 34.0| 1| 25| |-36nnCT71XE0InJXK...| 09|2016| 6| 14.0| 1| 25| |-36nnCT71XE0InJXK...| 03|2016| 5| 9.0| 1| 25| |-36nnCT71XE0InJXK...| 07|2014| 1| 1.0| 1| 25| |-36nnCT71XE0InJXK...| 06|2017| 31| 65.0| 2| 25| |-36nnCT71XE0InJXK...| 03|2020| 30| 59.0| 1| 25| |-36nnCT71XE0InJXK...| 09|2019| 29| 58.0| 1| 25| |-36nnCT71XE0InJXK...| 01|2018| 37| 75.0| 2| 25| |-36nnCT71XE0InJXK...| 02|2017| 33| 69.0| 3| 25| |-36nnCT71XE0InJXK...| 07|2020| 65| 125.0| 2| 25| |-36nnCT71XE0InJXK...| 05|2017| 14| 31.0| 1| 25| |-36nnCT71XE0InJXK...| 11|2015| 4| 8.0| 1| 25| |-36nnCT71XE0InJXK...| 10|2016| 15| 35.0| 2| 25| |-36nnCT71XE0InJXK...| 08|2018| 25| 50.0| 1| 25| |-36nnCT71XE0InJXK...| 02|2015| 2| 2.0| 1| 25| |-36nnCT71XE0InJXK...| 06|2018| 41| 86.0| 2| 25| |-36nnCT71XE0InJXK...| 09|2018| 26| 51.0| 1| 25| |-36nnCT71XE0InJXK...| 03|2019| 27| 52.0| 1| 25| |-36nnCT71XE0InJXK...| 07|2015| 3| 7.0| 1| 25| +--------------------+-----+----+--------------+----------------+-------------------+-----+ only showing top 20 rows
x1 = final_df.select('sum(rev_count)').rdd.flatMap(lambda x: x).collect()
x2 = final_df.select('sum(running_sum)').rdd.flatMap(lambda x: x).collect()
y = final_df.select('sum(combined_count)').rdd.flatMap(lambda x: x).collect()
x1 = np.array(x1).reshape(-1,1)
x2 = np.array(x2).reshape(-1,1)
xx = [[i[0],j[0]] for i, j in zip(x1,x2)]
reg = LinearRegression().fit(xx, y)
# print(f"linear regression score: {reg.score(x, y)}")
print(f"linear regression score: {reg.score(xx, y)}")
print(f"coefficient: {reg.coef_}")
print(f"y-intercept: {reg.intercept_}")
linear regression score: 0.00013265360198433385 coefficient: [-5.83676050e-05 9.08684594e-05] y-intercept: 6.196444054201265
Based on reviewing the data and relating the different tables together, we found 3 different attempts at correlating reviews/checkins to business. All three attempts lead to a very low coefficient of determination or "score" which tells us that reviews and ratings have no impact on future business.